Data-driven Amharic-English Bilingual Lexicon Acquisition

نویسنده

  • Saba Amsalu
چکیده

This paper describes a simple approach of statistical language modelling for bilingual lexicon acquisition from Amharic-English parallel corpora. The goal is to induce a seed translation lexicon from sentence-aligned corpora. The seed translation lexicon contains matches of Amharic lexemes to weekly inflected English words. Purely statistical measures of term distribution are used as the basis for finding correlations between terms. An authentic scoring scheme is codified based on distributional properties of words. For low frequency terms a two step procedure of: first a rough alignment; and then an automatic filtering to sift the output and improve the precision is made. Given the disparity of the languages and the small size of corpora used the results demonstrate the viability of the approach.

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

Web Mining for an Amharic - English Bilingual Corpus

We present recent work aimed at constructing a bilingual corpus consisting of comparable Amharic and English news texts. The Amharic and English texts were collected from an Ethiopian news agency that publishes daily news in Amharic and English through their web page. The Amharic texts are represented using Ethiopic script and archived according to the Ethiopian calender. The overlap between th...

متن کامل

Development of Myanmar-English Bilingual WordNet like Lexicon

A bilingual concept lexicon is of significance for Information Extraction (IE), Machine Translation (MT), Word Sense Disambiguation (WSD) and the like. Myanmar-English Bilingual WordNet like Lexicon (MEBWL) is developed to fulfill the requirements of Language Acquisition (LA). However, it is reasonably difficult to build such a lexicon is quite challenging in time and cost consuming. To overcom...

متن کامل

Minimal Dependency Translation: a Framework for Computer-Assisted Translation for Under-Resourced Languages

This paper introduces Minimal Dependency Translation (MDT), an ongoing project to develop a rule-based framework for the creation of rudimentary bilingual lexicon-grammars for machine translation and computer-assisted translation into and out of under-resourced languages as well as initial steps towards an implementation of MDT for English-to-Amharic translation. The basic units in MDT, called ...

متن کامل

Dictionary-based Amharic - English Information Retrieval

We present two approaches to the Amharic – English bilingual track in CLEF 2004. Both experiments use a dictionary based approach to translate the Amharic queries into English Bags-of-words, but while one approach removes non-content bearing words from the Amharic queries based on their IDF value, the other uses a list of English stop words to perform the same task. The resulting translated (En...

متن کامل

Disambiguating bilingual nominal entries against WordNet

One reason why the lexical capabilities of NLP systems have remained weak is because of the labour intensive nature of encoding lexical entries for the lexicon. It has been estimated that the average time needed to construct manually a lexical entry for a Machine Translation system is about 30 minutes [Neff et al. 93]. The automatic acquisition of lexical knowledge is the main field of the rese...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

عنوان ژورنال:

دوره   شماره 

صفحات  -

تاریخ انتشار 2006